Pbm: A new dataset for blog mining

نویسندگان

Mehwish Aziz

Muhammad Rafi

چکیده

Text mining is becoming vital as Web 2.0 offers collaborative content creation and sharing. Now Researchers have growing interest in text mining methods for discovering knowledge. Text mining researchers come from variety of areas like: Natural Language Processing, Computational Linguistic, Machine Learning, and Statistics. A typical text mining application involves preprocessing of text, stemming and lemmatization, tagging and annotation, deriving knowledge patterns, evaluating and interpreting the results. There are numerous approaches for performing text mining tasks, like: clustering, categorization, sentimental analysis, and summarization. There is a growing need to standardize the evaluation of these tasks. One major component of establishing standardization is to provide standard datasets for these tasks. Although there are various standard datasets available for traditional text mining tasks, but there are very few and expensive datasets for blog-mining task. Blogs, a new genre in web 2.0 is a digital diary of web user, which has chronological entries and contains a lot of useful knowledge, thus offers a lot of challenges and opportunities for text mining. In this paper, we report a new indigenous dataset for Pakistani Political Blogosphere. The paper describes the process of data collection, organization, and standardization. We have used this dataset for carrying out various text mining tasks for blogosphere, like: blogsearch, political sentiments analysis and tracking, identification of influential blogger, and clustering of the blog-posts. We wish to offer this dataset free for others who aspire to pursue further in this domain.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new stochastic 3D seismic inversion using direct sequential simulation and co-simulation in a genetic algorithm framework

Stochastic seismic inversion is a family of inversion algorithms in which the inverse solution was carried out using geostatistical simulation. In this work, a new 3D stochastic seismic inversion was developed in the MATLAB programming software. The proposed inversion algorithm is an iterative procedure that uses the principle of cross-over genetic algorithms as the global optimization techniqu...

متن کامل

MINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS

This research aims at proposing a new method for discovering frequent temporal itemsets in continuous subsets of a dataset with quantitative transactions. It is important to note that although these temporal itemsets may have relatively high textit{support} or occurrence within particular time intervals, they do not necessarily get similar textit{support} across the whole dataset, which makes i...

متن کامل

Minimizing the Repeated Database Scan Using an Efficient Frequent Pattern Mining Algorithm in Web Usage Mining

Data Mining, is the process of discovery of new patterns and knowledge from large dataset. Web mining is the application of data mining techniques to extract and mine useful knowledge and interesting patterns from World Wide Web .Web data including web documents, hyperlinks between documents, usage logs of web sites. The web usage data captures the identity and origin of the web user along thei...

متن کامل

Understanding Travel Destinations From Structured Tourism Blogs

The increasing popularity of tourist generated content has created abundant opportunities for people to understand the opinions and experiences of prior tourists. However, till now no framework has been presented to automatically discover useful patterns from structured tourism blogs. In this paper, we present a method to mine the tourism information such as frequented spots and popular travel ...

متن کامل

Mining Access Patterns Using Clustering

Web usage mining is an application of data mining techniques to discover usage patterns from web data, in order to understand and better serve the needs of web based application. The aim of this paper is to discuss about a system proposed which would perform clustering of user sessions extracted from the web logs.HTML links are extracted from these web logs for each user which constitutes the d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1201.2073 شماره

صفحات -

تاریخ انتشار 2011

Pbm: A new dataset for blog mining

نویسندگان

چکیده

منابع مشابه

A new stochastic 3D seismic inversion using direct sequential simulation and co-simulation in a genetic algorithm framework

MINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS

Minimizing the Repeated Database Scan Using an Efficient Frequent Pattern Mining Algorithm in Web Usage Mining

Understanding Travel Destinations From Structured Tourism Blogs

Mining Access Patterns Using Clustering

عنوان ژورنال:

اشتراک گذاری